CS7641 - Machine Learning - Reinforcement Learning

RL1: Markov Decision Processes

Quiz

Deterministic World

Introducing Uncertainty

  1. Each step is independent, thus the answer is the sum of the probabilities of expected actions ($8^5=.32768$)
  2. BUT, we need to also account for scenario where the agent moves unexpectedly but reaches the goal regardless. This scenario is "right right up up right", where horizontal moves are the unexpected moves. The sum of this probability is $.1^4 * .8 = 0.00008$
  3. Summing the two Ps together gives us $.32776$

Markov Decision Processes

Key Concepts:

  1. State: The possible scenarios of the world the agent can be in.
    • $s$
  2. Actions: Set of actions the agent can take based on its state.
    • $A(s)$
  3. Model: The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state $s'$.
    • $T(s,a,s') \sim P(s'|s,a)$
  4. Reward: The reward (or punishment) received based on the outcome of the agent's action.
    • $R(s) \mapsto \mathbb{R}$
  5. Policy: Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.
    • $\Pi(s) \mapsto a$
  6. Markovian property: Only the present matters, meaning no matter how a system arrived at its current state it does not affect possible future state.

Rewards

Key Concepts:

  1. Delayed Reward: No understanding of consequence of immediate action. Hard to tell which moves along the path had meaning impact on overall result.
  2. Minor change, big impact: Small negative reward in all states encourages agent to end the game. But too much negative reward may make the agent "suicidal" by choosing to end the game by taking an undesired action (e.g. jumping into bad end game state).

Quiz: Super positive/negative rewards

  1. Not inclined to end the game so will always choose moves away from the positive end game state.
  2. Agent is inclined to just end the game, unless the reward is imminent.
    • Note that the middle block could get to +1 (end with -3 reward), but why do that when you can just end the game by moving right and getting -1.

Utility

Key Concepts:

  1. Infinite Horizon: Assumption that a simulation (ie action+states) indefinitely over time.
  2. Utility: If we prefer a specific state currently, we will always prefer it.
$$ \text{If } U(s_0, s_1, s_2, \ldots) > U(s_0, s^{'}_1, s^{'}_2, \ldots) \\ \text{then } U(s_1, s_2, \ldots) > U(s^{'}_1, s^{'}_2, \ldots) $$
  1. The utility of a sequence of states is simply the sum of its rewards:
$$ U(s_1, s_2, \ldots) = \sum^\infty_{t=0}R(s_t) $$
  1. IMPORTANT: Utility considers the long-term benefit of a state or action, while rewards only consider the immediate state or action.

Discounted Rewards

Quiz: Which one is better?

  • A: Neither, because they both sum to infinity. This presents a problem with a naive sum of rewards.

Discounted Rewards: Decrease the impact of future rewards by adding a fractional factor gamma $\gamma$ to the sum of rewards.

  • Why? It gives a way to handle the problem of infinite sequences, and a way to trade-off immediate and future rewards.
  • Discount of zero pulls future rewards to zero, while a discount of 1 sets all rewards current and future to be the same.

Bellman Equation

Key Concepts:

  • Bellman's Equation: "The true utility of a state is its immediate reward plus all discounted future rewards (utility)" ~ George's Notes, pg.92
  • By determining utilities, we will know what's the optimal action to take at every state.

Finding Policies

Value Iteration

Key Concepts:

  • "The Bellman equation is in n variables with n unknowns, but they're not linear equations." ~ George's Notes, p.92.
    • Because of the $max$, it becomes a non-linear equation.
  • Start with noise, then keep adding truth to the equation. The truth (coming from rewards), at some point, will overwhelm the errors of the initial arbitrary utilities.

Value Iteration Quiz:

(Remember to factor transition probabilities: 0.8 of moving as intended, and 0.1 of going left/right)

TODO: Revisit this quiz. I don't understand why in $U_2(s)$, $U_1(x)$ and reward are multiplied by 0.1.

Video to solution: https://youtu.be/3kMLVzOK6D0

Policy Iteration

Key Concepts:

  • "We want a policy; it maps states to actions, not utilities. And though we can use utilities to find actions, it's way more work than is necessary; the optimal action would be sufficient." ~ George's Notes, p.93
  • Turns the non-linearity (due to $max$) in Value Iteration into something that has $n$ linear equations.
  • Guaranteed to converge (finite # of policies, always getting better).

RL2: Reinforcement Learning

MDP vs Reinforcement Learning

From George's notes:

Previously, we knew everything there is to know about the world: the states, transitions, rewards, etc. In reality though, a lot of that is hidden from us: only after taking (or trying to take) an action can we see its effect on the world (ie. our new state) and the resulting reward.

Below are the 4 components to reinforcement learning:

Previously, we used model based approaches to learning. In our new use-case, we attempt to build a model-free reinforcement learner. How? By using a value-based approach that learns utilities to map states to actions.

Q-Learner

Quiz - Substituting Utility and Policy with Q

Algorithm

Quiz - V Convergence

Q-learning Convergence

From George's Notes (p.94):

The foundation behind Q-learning is using data that we learn about the world as we take actions in order to evaluate the Bellman equation. The fundamental problem, though, is that we don't have $R$ or $T$, but we do have samples about the world: $(s,a,r,s')$ is the reward and new state that result from taking an action in a state. Given enough of these tuples, then, we can estimate this Q-function.

Choosing Actions (maybe randomly)

Key Concepts:

  • There are various of ways of choosing actions suboptimally (always same action, random action, just choose $\hat{Q}$ which leads to local minima)
  • $\epsilon$-Greedy Exploration: We can balance exploration and exploitation by taking the optimal $\hat{Q}$ most of the time but also choosing a random action by a value of $1-\epsilon$.
  • $\epsilon$ also decays over time, so we explore less and less as we go along.

RL3: Game Theory

Simple Game

"Two player, zero-sum finite game of perfect information with deterministic outcome"

simple_game

Game Matrix

matrix

  • Can think of each strategy of both players as their policies (in RL terms)

Minimax

minmax

  • A considers worst case counter, and B likewise (A maximizes, B minimzes)
  • A wants to get the minimum maximum and B vice versa.
  • In matrix form, doens't matter who goes first.
  • Minmax is where to two optimal strategies intersect.
  • From the matrix, you can construct a tree that is consistent with it.

Definition: In a 2-player, zero-sum deterministic game of perfect info. $Minimax \equiv Maximin$ and there always exists an optimal pure strategy for each player.

  • Optimal is in relation to each trying to get the best result.
  • Big ifs are perfect information and pure strategies.
  • AKA von Neumann's theorem

In a zero-sum game, where the players' payoffs sum to zero, there exists a solution where the maximum payoff for one player is equal to the minimum payoff for the other player.

Game Tree (Non-deterministic)

img

  • Stochasticity occurs at leaves.
  • Just calculate the expected value using the probability of landing at leaf nodes
  • The matrix doesn't show the probabilities. In the end, we don't care about them, only the matrix matters to solve the game.
  • A wouldn't want to go left, so goes right. B wouldn't want to go right (since A gets 3) so goes left, leaving us at -2.

Takeaway: If game is zero-sum deterministc perfect info, we don't care about the details of the game. We just need the matrix of possible results and apply minimax to find the intersection of the players' choices.

Quiz: Mini-Poker (Non-deterministic, Hidden Info)

img

Mini-Poker Tree

img

What is the value of this game?

  • von Neumann (minimax) fails, there is no optimal intersection (ie pure strategy).
  • We can get every single one of these values due to hidden information.
  • We also don't want to be consistent with strategies, because the other can perceive that pattern and exploit it (ie. "He always bluffs").
  • In these cases, we start using mixed strategies, ie a distribution over strategies.

Quiz: Mixed Strategy

img

Mixed strategy: Distribution over strategies, where P is probability of being holder.

Visualizing the Mixed Strategy

img

Significance: The two strategies overlap at $p=0.4$.

  • If A goes with a mixed strategy, where with p=0.4 he he chooses to be a holder, the expected value of his score is +1.
  • Regardless of any probability B chooses (ie within the bowtie), on average A's score is still +1.
  • Max can be far left, center, far right, or never cross (!).

Snitch - (2 player, non-sum, non-deterministic game of hidden info)

img

Significance: Matrix shows that for either side, the highest value is to defect (For A - defect: 0 vs coop: -1 if B coops, or defect: -6 vs coop: -9 if B defects). This is known as prisoner's dilemma.

Nash Equilibrium

ChatGPT:

Nash equilibrium is a concept in game theory that describes a state of a game in which each player chooses a strategy that is optimal given the strategies chosen by all other players. In other words, in a Nash equilibrium, no player can benefit by changing their strategy while the other players keep their strategies unchanged.

Nash equilibrium is often used to predict the outcome of competitive situations where each player has a choice of different strategies, such as in business competition, auctions, and bargaining situations.

img

Nash Tips

  1. In the n-player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination is the Nash Equilibrium.
  2. Any N.E. will survive elimination of strictly dominated strategies.
  3. If n is finite and $\forall_i$, $S_i$ is finite $\exists$ mixed N.E.
  4. Can have a "pure" or "mixed" Nash equilibrium. It can be mixed if no one wants to change their probability distributions.

Nash Equilibrium - Quiz

img

  1. Problem 1 - A will always choose bottom row (strictly dominated)
  2. Problem 2 - We find there's a global max that benefits both (6,6), so both players will always choose that.

Repeats / Two Step

  • N.E. says if you repeat the game $n$ times, you will get the same Nash equilibrium for all $n$ times.
  • Why? Because even if you play multiple games and build some "trust" of of cooperation (ie. prisoner's dilemma) - that is the perfect moment for the adversary to betray the other player (presuming there is no notion of guilt).
In [ ]:
 

RL4: Game Theory, Continued

Uncertainty

img

  • Games with uncertain endings. Game ends with probability $\frac{1}{1-\gamma}$
  • Number of goes to infinity as $\gamma$ moves to 1.

Tit-for-Tat (iterated prisoner's dilemma)

img

TFT Quiz

img

TFT Quiz 2

img

Best Response to a Finite-state Strategy

img

Takeaways:

  1. Finite-state strategies can be represented as a finite state machine (not a matrix, matrix only works for one game)
  2. Finite-state machines can be solved using MDP, where $\gamma$ represents discount rewards.

IPD Quiz

img

In [ ]:
 
In [1]:
from distutils.sysconfig import get_python_lib
print(get_python_lib())
/usr/local/Caskroom/miniconda/base/envs/ds_env/lib/python3.8/site-packages
In [ ]: